Project Python Foundations: FoodHub Data Analysis¶

Marks: 60

Context¶

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

Objective¶

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

Data Description¶

The data contains the different data related to a food order. The detailed data dictionary is given below.

Data Dictionary¶

  • order_id: Unique ID of the order
  • customer_id: ID of the customer who ordered the food
  • restaurant_name: Name of the restaurant
  • cuisine_type: Cuisine ordered by the customer
  • cost: Cost of the order
  • day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
  • rating: Rating given by the customer out of 5
  • food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
  • delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

Let us start by importing the required libraries¶

In [2]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# importing plotly
import plotly.express as px

# to restrict the float value to 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)

import warnings
warnings.filterwarnings('ignore')

Understanding the structure of the data¶

In [4]:
# read the data
df = pd.read_csv('foodhub_order.csv')
# returns the first 5 rows
df.head()
Out[4]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.750 Weekend Not given 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.080 Weekend Not given 25 23
2 1477070 66393 Cafe Habana Mexican 12.230 Weekday 5 23 28
3 1477334 106968 Blue Ribbon Fried Chicken American 29.200 Weekend 3 25 15
4 1478249 76942 Dirty Bird to Go American 11.590 Weekday 4 25 24

Observations:¶

Per above, the DataFrame has 9 columns. Columns for each row reprsents the attributes of the order placed by a customer to the food aggregator company

Question 1: How many rows and columns are present in the data? [0.5 mark]¶

In [5]:
# Write your code here
print("There are", df.shape[0], 'rows and', df.shape[1], "columns.")
There are 1898 rows and 9 columns.

Observations:¶

  • The Dataset has 1898 number of rows with 9 columns

Question 2: What are the datatypes of the different columns in the dataset? (The info() function can be used) [0.5 mark]¶

In [6]:
# Use info() to print a concise summary of the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB

Observations:¶

  • There are 4 object columns, 4 integer columns & one Float column
  • rating column, because of the presence of non-numeric data, is set as Object data type

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method. [1 mark]¶

In [7]:
# Write your code here
# checking missing values
df.isnull().sum()
Out[7]:
order_id                 0
customer_id              0
restaurant_name          0
cuisine_type             0
cost_of_the_order        0
day_of_the_week          0
rating                   0
food_preparation_time    0
delivery_time            0
dtype: int64

Observations:¶

  • There are no missing or Null values in the dataset
  • The rating column however, does have a value 'Not given' that requires some attention.
  • There could be more than one interpretation of 'Not given'. It could be that Customer chose not to give rating or it could be to represent missing rating data.
  • To enable us to use dataset that contains only those rows where the rattings were available, we need to create a new dataset that is a copy of only those rows where rattings were provided
  • Once we remove rows with "Not given" in the rating column, we are only left with rows that contain descrete numbers in the rating column so that column can than be coverted as int for further analysis
  • Please note that we are not removing any rows from the orginal dataframe, just crearing another copy of that to allow some quantiative analys of rating data with rating column as int where rating was provided
In [9]:
#1 Filter the rated restaurants
df_rated = df[df['rating'] != 'Not given'].copy()
#2 convert rating column from object to integer
df_rated['rating'] = df_rated['rating'].astype('int')

Question 4: Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]¶

In [10]:
# Write your code here
df.describe(include= 'all').T
Out[10]:
count unique top freq mean std min 25% 50% 75% max
order_id 1898.000 NaN NaN NaN 1477495.500 548.050 1476547.000 1477021.250 1477495.500 1477969.750 1478444.000
customer_id 1898.000 NaN NaN NaN 171168.478 113698.140 1311.000 77787.750 128600.000 270525.000 405334.000
restaurant_name 1898 178 Shake Shack 219 NaN NaN NaN NaN NaN NaN NaN
cuisine_type 1898 14 American 584 NaN NaN NaN NaN NaN NaN NaN
cost_of_the_order 1898.000 NaN NaN NaN 16.499 7.484 4.470 12.080 14.140 22.297 35.410
day_of_the_week 1898 2 Weekend 1351 NaN NaN NaN NaN NaN NaN NaN
rating 1898 4 Not given 736 NaN NaN NaN NaN NaN NaN NaN
food_preparation_time 1898.000 NaN NaN NaN 27.372 4.632 20.000 23.000 27.000 31.000 35.000
delivery_time 1898.000 NaN NaN NaN 24.162 4.973 15.000 20.000 25.000 28.000 33.000

Observations:¶

  • The above table displays the descriptive analysis for Numerical columns
  • It can be observed that the average time to prepare the food is around 27.37 min i.e 27 min & 22 sec.
  • The mimimum time to prepare the food is 20 min
  • The maximum time to prepare the food is 35 min
  • The median time to prepare the food 27 min, almost the same as mean, which means the food preparation time following Normal distribution.

Question 5: How many orders are not rated? [1 mark]¶

In [11]:
# Write the code her
unrated_count = len(df[df['rating']=='Not given'])
unrated_percent = round(100*unrated_count/df.shape[0],2)
print(f'Out of a total of {df.shape[0]} orders, {unrated_count}, which is about {unrated_percent}%, are not rated' )
Out of a total of 1898 orders, 736, which is about 38.78%, are not rated
In [12]:
df['rating'].unique()
Out[12]:
array(['Not given', '5', '3', '4'], dtype=object)

Observations:¶

736 (~38.8%) Orders are not rated. The unique() function is called to double check to ensure that was no other way the absence of rating was reported

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Question 6: Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]¶

In [13]:
# Write the code here
df['restaurant_name'].nunique()
Out[13]:
178

Observations On Restaurants:¶

There were 178 unique resutaurants where orders were placed, lets see find out out how they stake in terms of number of times they show up in the dataset

In [14]:
plt.figure(figsize=(75,55))
fig = px.histogram(df, x='restaurant_name', color="cuisine_type", labels={
                     "restaurant_name": "Restaurant Name",
                     "count":"count"
                 },
                title="Orders per Restaurant").update_xaxes(categoryorder='total descending')
fig.show()
<Figure size 7500x5500 with 0 Axes>
  • Shake Shack has the highest order count
  • A very disproportionate number of orders are from a handful of restaurants, indicating a need to look into the root cause of that & take appropriate actons to address the root cause. e.g, more marketing & incentivizing to those restuarnats with so few orders as well as their customers to use this order delivery service
  • As ilustrrated below, If you hover mouse over the lower end of the second highest order count restaurant, "The Meat Ball Shop", you will notice that it appears to be the only restaurant here that serves more than one Cusine Type. In this case it's both American & Italian Multi-Cusine-Restaurant.jpg

Observations On Cuisine Types:¶

In [16]:
print(f"There are a total of {df['cuisine_type'].nunique()} unique cuisine types:")
There are a total of 14 unique cuisine types:

List of Unique Cuisine Types are as follows:

In [17]:
print(df['cuisine_type'].unique())
['Korean' 'Japanese' 'Mexican' 'American' 'Indian' 'Italian'
 'Mediterranean' 'Chinese' 'Middle Eastern' 'Thai' 'Southern' 'French'
 'Spanish' 'Vietnamese']
In [18]:
df['cuisine_type'].value_counts()
Out[18]:
American          584
Japanese          470
Italian           298
Chinese           215
Mexican            77
Indian             73
Middle Eastern     49
Mediterranean      46
Thai               19
French             18
Southern           17
Korean             13
Spanish            12
Vietnamese          7
Name: cuisine_type, dtype: int64
In [19]:
plt.figure(figsize=(12,8))
sns.countplot(data=df, x='cuisine_type', order = df['cuisine_type'].value_counts().index)
plt.xticks(rotation=60)
plt.xlabel('Cuisine Type')
plt.ylabel('Order Count');
  • American, Japanese, Italian & Chinese Cusisine Types, in that order, outnumbered other cuisine types
  • Vietnamese cuisine type has the lowest count

Observations On Order Costs:¶

In [20]:
plt.figure(figsize=(12,8))
sns.histplot(data=df, x='cost_of_the_order', kde=True, bins=35)
plt.show()

The tallest bin ranges between \$12 & \$13, indicating the range of order cost with higherst number of orders

In [21]:
sns.boxplot(data=df,x='cost_of_the_order')
plt.show()

From the box plot, it can be observed that the order cost distribution is right skewed & there are No outliers. The median cost is around \$14 and the 1st quartlie is around \$12 and 3rd quartile is around \$22.5. The inner quartile range is 22.5-12 ~ \$10. Min is around \$4 dollars and max is around \$35

Observations On Day Of The Week:¶

In [22]:
plt.figure(figsize=(12,8))
sns.countplot(data=df, x='day_of_the_week')
Out[22]:
<Axes: xlabel='day_of_the_week', ylabel='count'>

The orders on Weekends are routhly 2.5 times of the Weekday orders. Since folks during weekdays tend to go for something quick, reducing the overall delivery time can positively contribute to Weekay order counts

Observations On rating:¶

In [23]:
print(f"There are a total of {df['rating'].nunique()} unique Rating Entries in the data set:")
There are a total of 4 unique Rating Entries in the data set:
In [24]:
ratingentries = df['rating'].unique()
print("List of Unique Rating Entries are", ratingentries)
List of Unique Rating Entries are ['Not given' '5' '3' '4']
In [25]:
plt.figure(figsize=(12,8))
sns.countplot(data=df, x='rating',order = df['rating'].value_counts().index);

Observations On Food Preparation Time:¶

In [26]:
plt.figure(figsize=(12,8))
sns.histplot(data=df, x='food_preparation_time', kde=True)
plt.xlabel('Food Prep. Time')
plt.title('Distribution of Food Preparation Time')
plt.show()

There appears to be no obvious pattern on Food Prep time by itself. Further analysis in conjunction with other columns may provide some insight

In [27]:
sns.boxplot(data=df,x='food_preparation_time')
plt.xlabel('Food Preparation time')
plt.show()

Median food preparation time is ~27 minutes. Max is 35 minutes. Min is 20 minutes. Data Inter Quartile Range is about 31-23 = 8 minutes

In [28]:
plt.figure(figsize=(12,8))
sns.histplot(data=df, x='delivery_time', kde=True)
plt.xlabel('Delivery Time')
plt.title('Delivery Time Distribution')
plt.show()

Large number of orders are getting delivered between around ~24 to ~29 minutes

In [29]:
sns.boxplot(data=df,x='delivery_time')
plt.xlabel('Delivery Time')
plt.show()

Median delivery time is 25 mins, min = 15 mins and max = 33 mins. The IQR is 7.5 mins. Food delivery time is left skewed & there are No outliers

Observations On Order ID¶

In [30]:
df['order_id'].nunique()
Out[30]:
1898

Since The number of Unique Order IDs is same as the total number of rows, this dataset only contains unquie orders so earch row exactly represents a unique order and there are No duplicate orders present

Observations On Customer ID & Repeat Customers¶

In [31]:
df['customer_id'].nunique()
Out[31]:
1200
In [33]:
#Lets check how is the distribution of order counts of these 1200 unique customers are
df_cust_id__order_freq_count = df.groupby(['customer_id'] , as_index=False)['customer_id'].value_counts()
df_cust_id__order_freq_count.groupby(['count'])['count'].value_counts()
Out[33]:
count  count
1      1        784
2      2        267
3      3         83
4      4         33
5      5         19
6      6          6
7      7          4
8      8          1
9      9          1
10     10         1
13     13         1
Name: count, dtype: int64
In [34]:
plt.figure(figsize=(12,8))
sns.countplot(data=df_cust_id__order_freq_count, x='count');
plt.title('Distribution of Unique Customer Order Frequency')
plt.xlabel('Unique Customer Occurrences')
plt.ylabel('Order Count');
  • Out of 1200 unique customers, 784 (65%) are one time customers, meaning they only used this service just once. i.e only 35% are repeat customers
  • 267 unique customers have ordered twice
  • 83 customers have placed orders 3 times
  • There is exactly one customer for each of the catogory of those who ordered 8, 9, 10 & 13 times

The fact that 784 customers only ordered once needs to be examined & addressed

Question 7: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]¶

Answer:¶

  1. Shake Shack
  2. The Meatball Shop
  3. Blue Ribbon Sushi
  4. Blue Ribbon Fried Chicken
  5. Parm
In [35]:
# Write the code here
top5 = df.groupby(['restaurant_name'])['restaurant_name'].value_counts().sort_values(ascending=False)[:5]
top5
Out[35]:
restaurant_name            restaurant_name          
Shake Shack                Shake Shack                  219
The Meatball Shop          The Meatball Shop            132
Blue Ribbon Sushi          Blue Ribbon Sushi            119
Blue Ribbon Fried Chicken  Blue Ribbon Fried Chicken     96
Parm                       Parm                          68
Name: restaurant_name, dtype: int64
In [36]:
# We can cross check the above, by hovering mouse over the first 5 highest counts in the below histogram
plt.figure(figsize=(75,55))
fig = px.histogram(df, x='restaurant_name', color="restaurant_name", labels={
                     "restaurant_name": "Restaurant Name",
                     "count":"count"
                 },
                title="Orders per Restaurant").update_xaxes(categoryorder='total descending')
fig.show();
<Figure size 7500x5500 with 0 Axes>

Observations:¶

Question 8: Which is the most popular cuisine on weekends? [1 mark]¶

Answer:¶

American

In [38]:
# Write the code here

df_weekend = df[df['day_of_the_week'] == 'Weekend']
df_weekend
df_weekend.groupby(['cuisine_type'])['cuisine_type'].value_counts().sort_values(ascending=False)[:1]
Out[38]:
cuisine_type  cuisine_type
American      American        415
Name: cuisine_type, dtype: int64

Observations:¶

Question 9: What percentage of the orders cost more than 20 dollars? [2 marks]¶

Answer:¶

29.24%

In [39]:
# Write the code here
total_num_of_orders_over_20_dollars = df.loc[df['cost_of_the_order'] > 20.0].count().iloc[0]
percent_of_orders_over_20_dollars = round(100 * total_num_of_orders_over_20_dollars / df.shape[0],2)
percent_of_orders_over_20_dollars
Out[39]:
29.24

Observations:¶

Question 10: What is the mean order delivery time? [1 mark]¶

In [40]:
# Write the code here

print('The mean delivery time for this dataset is', round(df['delivery_time'].mean(),2), 'minutes')
The mean delivery time for this dataset is 24.16 minutes

Answer:¶

24.16 Minutes

Observations:¶

Question 11: The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]¶

Answer¶

Customer IDs of Top 3 most fequent customer on thar order are:

  • 52832
  • 47440
  • 83287
In [41]:
# Write the code here
Top3Customers = df.groupby(['customer_id'])['customer_id'].value_counts().sort_values(ascending=False).head(3)
Top3Customers
Out[41]:
customer_id  customer_id
52832        52832          13
47440        47440          10
83287        83287           9
Name: customer_id, dtype: int64

Observations:¶

Multivariate Analysis¶

Question 12: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]¶

In [42]:
plt.figure(figsize=(12,8))
sns.countplot(data=df, x='cuisine_type', hue='day_of_the_week', order = df['cuisine_type'].value_counts().index)
plt.xticks(rotation=60)
plt.legend(title='Day of the week')
plt.tight_layout()
plt.xlabel('Cuisine Type')
plt.ylabel('Order Count')
plt.show();

Observation

  • There are lot more orders for the weekends than the weekdays for every cuisine types
  • Popularity ranking remains the same regardless of weekday vs weekend

Cuisine vs Cost of the order¶

In [43]:
# Relationship between cost of the order and cuisine type
plt.figure(figsize=(15,7))
fig=px.box(df, x = "cuisine_type", y = "cost_of_the_order", color="cuisine_type", labels={
                     "cuisine_type": "Cuisine Type",
                     "cost_of_the_order":"cost of the order"
                 },
                title="Order Cost By Cuisine Type")
fig.show()
<Figure size 1500x700 with 0 Axes>

Observation

Mexican, Indian, and Middle Eastern have low order counts relative to the "big 4" - Japanese, American, Italian, and Chinese. Japanese, Mexican, American, Indian, Italian, Chinese, and Middle Eastern have similar order cost distributions Korean, Mediterranean, and Vietnamese have low order counts & all have outliers

Lets analyse food prep time per cusine type

In [44]:
plt.figure(figsize=(20,15))
fig = px.box(df, x="food_preparation_time", y="cuisine_type", color="cuisine_type", labels={
                     "food_preparation_time": "food prep time",
                     "cuisine_type": "Cuisine Type" },
                title="Cusine Type & Food Prep Time")
fig.show();
<Figure size 2000x1500 with 0 Axes>

Observation

  • Mediterranean, Italian, Indian, American and Mexican have similar preparation time distribution.
  • The maximum prep times for Thai & Southern appear to be higher than others
  • The less fequently ordered cuisine types have more variations in the prep times

Lets analyse food delivery time per cusine type

In [45]:
plt.figure(figsize=(20,15))
fig = px.box(df, x="delivery_time", y="cuisine_type", color="cuisine_type", labels={
                     "delivery_time": "Delivery Time",
                     "cuisine_type": "Cuisine Type" },
                title="Cusine Type & Delivery Time")
fig.show();
<Figure size 2000x1500 with 0 Axes>

Observation

  • Delivery Times have lot more variations than prep time
  • With the exception of Koreans, Vietnamese & French, the delivery Times for all others follow similar distributions
  • Cusines with lot more orders have consistent Delivery Times
In [46]:
# Relationship between rating and cuisine type
plt.figure(figsize=(20,15))
fig = px.box(df_rated, x="cuisine_type", y="rating", color="cuisine_type", labels={
                     "cuisine_type": "cuisine_type",
                     "rating": "rating" },
                title="Rating by Cusine Type")
fig.show();
<Figure size 2000x1500 with 0 Axes>

Observation

With the exception of Mediterranean, Spanish, Korean & Vietnamese, all others follow similar distribution Vietnamese is the only one having its IQR of ratings spead the widest from 3 to 5

Day of the Week vs Delivery time¶

In [79]:
# Relationship between day of the week and delivery time
plt.figure(figsize=(20,15))
fig = px.box(df, x="day_of_the_week", y="delivery_time", color="day_of_the_week", labels={
                     "delivery_time": "delivery time",
                     "day_of_the_week":"day of the week"
                 },
                title="Delivery times by day of the Week")
fig.show();
<Figure size 2000x1500 with 0 Axes>

Weekday has more delivery times, possilbly due to heavier weekday traffic compared to weekends. Howerver, the IRQ for weekdays are shorter than that of weekend, pointing to somewhat more variability in weekend traffic compared to weekdays

Rating vs Food preparation time¶

In [81]:
# Relationship between rating and food preparation time
plt.figure(figsize=(15, 7))
sns.pointplot(data=df_rated, x='rating', y='food_preparation_time')
plt.show();

It is understandble to see that higher food preparation times correspond to lower order ratings

Orders with rating 3 has a huge variation in food prep time. Perhaps it is that variation that is contributing to lower ratting, possibly not meeting customers expectations compared to those with rating 4 & 5

In [80]:
# Relationship between rating and cost of the order
plt.figure(figsize=(15, 7))
sns.pointplot(data=df_rated, y='cost_of_the_order', x='rating')
plt.show()

Observation: The cost varies much more for lower ratted orders & as the rating goes higher the cost distibutions narrows¶

Correlation among variables¶

In [60]:
# Plot the heatmap
col_list = ['cost_of_the_order', 'food_preparation_time', 'delivery_time']
plt.figure(figsize=(15, 7))
sns.heatmap(df[col_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Observation¶

Per the above heatmaps, there appears to be no correlation between the numeric data points

In [61]:
# let's create a column with a sum of food preparation time and the food delivery time to get to the total ourder fulfillment time
df_rated['order_fulfillement_time']= df_rated['food_preparation_time'] + df_rated['delivery_time']
df.head()
Out[61]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.750 Weekend Not given 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.080 Weekend Not given 25 23
2 1477070 66393 Cafe Habana Mexican 12.230 Weekday 5 23 28
3 1477334 106968 Blue Ribbon Fried Chicken American 29.200 Weekend 3 25 15
4 1478249 76942 Dirty Bird to Go American 11.590 Weekday 4 25 24
In [8]:
plt.figure(figsize=(10,5))
sns.pointplot(data=df_rated,y='order_fulfillement_time',x='rating')
plt.show()

Observation:

  • Orders with higher order fulfillement time, as you would expect are rated lower
  • Order that are rated as 3 has lot more varation in order fulfillement times

Lets come up with some derived numbers based on following assumptions:

Order Fulfillment time = prep time + delivery time

Analyse the the relation of delivery time to other data points to see if there is an opportunity to make some decisions based on delivery times such as reduce cost & / or improve rating

While we don't have data, batching of deliveries from same restaurants need to be availed

Look for opportunities to batch deliveries from same restaurants.

With knowledge of order cost per delivery time, an informed decistion can be made as to what's the minimum order cost allowed to use this service

Important order metrics

cost_per_delivery_time = order cost / delivery time (\$/minute)

Prep time: Calculated per restaurant, per cuisine, overall

Used to prevent deliverers from having to wait for delivery to be ready. Order ratio = # orders / restaurant(s) (greater is better) Can be calculated per restaurant, per cuisine, overall Used to identify busier restaurants where deliveries might be batched.

In [62]:
# create new df with added order_fulfillment_time and cost_per_delivery_time columns
order_analytics_df = df
order_analytics_df['order_fulfillment_time'] = order_analytics_df['food_preparation_time']+order_analytics_df['delivery_time']
order_analytics_df['cost_per_delivery_time'] = order_analytics_df['cost_of_the_order']/order_analytics_df['delivery_time']
order_analytics_df.describe().T
Out[62]:
count mean std min 25% 50% 75% max
order_id 1898.000 1477495.500 548.050 1476547.000 1477021.250 1477495.500 1477969.750 1478444.000
customer_id 1898.000 171168.478 113698.140 1311.000 77787.750 128600.000 270525.000 405334.000
cost_of_the_order 1898.000 16.499 7.484 4.470 12.080 14.140 22.297 35.410
food_preparation_time 1898.000 27.372 4.632 20.000 23.000 27.000 31.000 35.000
delivery_time 1898.000 24.162 4.973 15.000 20.000 25.000 28.000 33.000
order_fulfillment_time 1898.000 51.534 6.834 35.000 47.000 52.000 56.000 68.000
cost_per_delivery_time 1898.000 0.719 0.381 0.153 0.451 0.614 0.918 2.202

Observations:

  • Average delivery time is 24.2 minutes.
  • The Min order cost_per_delivery_time is 15.3 cents/minute
  • Mean cost_per_delivery_time is 71.9 cents/minute
  • Max order cost_per_delivery_time is \$2.20/minute
In [63]:
#sns.histplot(data=order_analytics_df,x='cuisine_type',y="cost_per_delivery_time", color='cusine_type')
#plt.xticks(rotation=90);

fig = px.density_heatmap(
    data_frame=order_analytics_df, x="cuisine_type", y="cost_per_delivery_time", color_continuous_scale="PuBu", labels={
                     "cuisine_type": "cuisine type",
                     "cost_per_delivery_time":"cost per delivery time"
                 },
                title="cost per delivery time by cuisine type")

fig.show()
  • Due to the fact that American, Japanese, Italian, and Chinese have the most orders, they have the densest cost per delivery times.
  • The above can be used for comparing relative performance of other Cuisines
In [82]:
order_analytics_df.sort_values(by='cost_per_delivery_time', ascending=True, inplace=True)

plt.figure(figsize=(20,15))
fig = px.box(order_analytics_df, x="cost_per_delivery_time", y="cuisine_type", color="cuisine_type", labels={
                     "cost_per_delivery_time": "cost per delivery time",
                     "cuisine_type": "Cuisine Type" },
                title="Cost Per Delivery by Cusine Type")
fig.show();
<Figure size 2000x1500 with 0 Axes>

Observations:

  • Cuisines with fewer orders show greater variance in cost per delivery time.
  • Cuisines with larger numbers of orders tend to skew right and have more high cost per delivery time outliers.
  • American, Japanese, Chinese, Italian have greatest number of orders and have very similar profiles.

Revenue & Order Profitability

In [83]:
# Lets check how many orders were 5 or below where company has no revenue, meaning its losing money for those orders, given the cost to deliver is there
orders_with_zero_revenue = len(df[df['cost_of_the_order'] < 6 ])
print(f'{orders_with_zero_revenue} orders were less than $6, contributing to 0 revenue')

revenue_with_cost_6_to_10 = df[ (df['cost_of_the_order'] < 11) & (df['cost_of_the_order'] > 5) ] ['revenue'].sum()
order_count_with_cost_6_to_10 = len(df[ (df['cost_of_the_order'] < 11) & (df['cost_of_the_order'] > 5) ])
revenue_per_order_with_cost_6_10 = round(revenue_with_cost_6_to_10 / order_count_with_cost_6_to_10,2)

print(f'{order_count_with_cost_6_to_10} orders were between $6 & $10, with revenue per order at: ${revenue_per_order_with_cost_6_10}')

revenue_with_cost_11_to_20 = df[ (df['cost_of_the_order'] < 21) & (df['cost_of_the_order'] > 10) ] ['revenue'].sum()
order_count_with_cost_11_to_20 = len(df[ (df['cost_of_the_order'] < 21) & (df['cost_of_the_order'] > 10) ])
revenue_per_order_with_cost_11_20 = round(revenue_with_cost_11_to_20 / order_count_with_cost_11_to_20,2)

print(f'{order_count_with_cost_11_to_20} orders were between $11 & $20, with revenue per order at: ${revenue_per_order_with_cost_11_20}')
79 orders were less than $6, contributing to 0 revenue
395 orders were between $6 & $10, with revenue per order at: $3.24
972 orders were between $11 & $20, with revenue per order at: $3.15

Question 13: The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]¶

Answer:¶

  1. The Meatball Shop
  2. Blue Ribbon Fried Chicken
  3. Shake Shack
  4. Blue Ribbon Sushi

Strategy: Due to the presence of 'Not given' in rating column, the rating column is currently set to Object type. To carry out the requsted work, that column needs to be Number type, like int. The Strategy will be to:¶

  1. Create a new dataframe from the original dataframe minus all the rows that contain"Not given" in the rating column.
  2. In the new datafame, the rating column data type can than be coverted to number type, i.e int
  3. From the the new dataframe, we can create another dataframe containing only the restaurant names & its corresponding rating counts.
  4. Filter only those restuarants that has received at least 50 ratings
  5. Go back to the full dataset containing ratings & filter it so that it only contains the restaurant names from previous steps
  6. Find out which are the ones having mean ratings of over 4
In [84]:
# Write the code here

#1 Filter the rated restaurants
df_rated = df[df['rating'] != 'Not given'].copy()
#2 convert rating column from object to integer
df_rated['rating'] = df_rated['rating'].astype('int')
#3 create dataframe containing only the restaurant name & its corresponding rating counts & update the column label
df_rating_count = df_rated.groupby(['restaurant_name'])['rating'].count().sort_values(ascending = False).reset_index()
# update the column label to reflect that it now shows the rating count
df_rating_count.rename(columns = {'rating':'rating count'}, inplace = True)
#4 filter to get the data of restaurants that have rating count more than 50
rest_over_50_ratings = df_rating_count[df_rating_count['rating count']>50]['restaurant_name']
#5 Filter the rated data so that it only contains the restaurant names from prev steps
rest_data_with_over_50_ratings = df_rated[df_rated['restaurant_name'].isin(rest_over_50_ratings)].copy()
#6 Extract the names of the restaurants who mean ratings are over 4
rest_names_over_4_mean_rating = rest_data_with_over_50_ratings.groupby(rest_data_with_over_50_ratings['restaurant_name'])['rating'].mean().sort_values(ascending = False).reset_index()['restaurant_name']
print(f'Restaurants that have a rating count of more than 50 and the average rating greater than 4 are:\n{rest_names_over_4_mean_rating}')
Restaurants that have a rating count of more than 50 and the average rating greater than 4 are:
0            The Meatball Shop
1    Blue Ribbon Fried Chicken
2                  Shake Shack
3            Blue Ribbon Sushi
Name: restaurant_name, dtype: object

Observations:¶

Question 14: The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]¶

Answer:¶

Net revenue generated by the company across all orders is $6,166

Strategy:¶

  1. Create a helper function that will return revenue for a given food cost value
  2. Extract a series that contains costs
  3. Iterate through each row of the costs series
  4. Populate a new Column "Revenue" with the calculated revenue for each row
  5. Calculate the total revenue by taking the sum of "Revenue" column
In [85]:
# Write the code here
#1 Following function returns the calculated revenue based on food cost
def calc_revenue(a):
    if a > 20:
        return a*0.25
    elif a > 5:
        return a*0.15
    else:
        return 0
In [68]:
#2 Extract a series that contains costs
costs = df['cost_of_the_order']
#3 Iterate through each row of the costs series
for i, cost_value in enumerate(costs):
  #4 Populate a new Column "Revenue" with the calculated revenue for each row
  df.loc[i,['revenue']] = calc_revenue(cost_value)
#5 Calculate the total revenue by taking the sum of "Revenue" column
net_revenue = df['revenue'].sum()
print(f"Net revenue generated by the company across all orders is ${net_revenue}")
Net revenue generated by the company across all orders is $6166.303

Observations:¶

Question 15: The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]¶

Answer:¶

10.54% of orders take more than 60 minutes to get delivered

In [86]:
# Write the code here
# let's create a column with a sum of food preparation time and the food delivery time to get to the total order delivery time
df['total_order_delevery_time']= df['food_preparation_time'] + df['delivery_time']
df.head()
# Extract all the rows where the newly created "total_order_delevery_time" column shows value greater than 60
df_over_60 = df[df['total_order_delevery_time'] > 60]
# Now calcualte the requested number using the above dataset row counts with that of the original data set containing all rows
percent_over_60 = round(100*df_over_60.shape[0]/df.shape[0],2)
print(f'{percent_over_60}% of orders take more than 60 minutes to get delivered')
10.54% of orders take more than 60 minutes to get delivered

Observations:¶

Question 16: The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]¶

In [87]:
# Write the code here
fig = px.box(df, x="day_of_the_week", y="delivery_time", color="day_of_the_week", labels={
                     "delivery_time": "delivery time",
                     "day_of_the_week":"day of the week"
                 },
                title="Delivery times by day of the Week")
fig.show()
In [88]:
mean_dt_weekend = round(df[df['day_of_the_week']=='Weekend']['delivery_time'].mean(),2)
mean_dt_weekday = round(df[df['day_of_the_week']=='Weekday']['delivery_time'].mean(),2)
mean_delivery_time_difference = round(mean_dt_weekday - mean_dt_weekend,2)
percentage_additional_mean_dt_weekday = round(100 * mean_delivery_time_difference / mean_dt_weekday)

percentage_weekday_delivery_time = mean_dt_weekday/mean_dt_weekend
print(f'The mean delivery times:\n Weekend: {mean_dt_weekend}\n Weekday: {mean_dt_weekday}\n')
print(f'On average, it takes about {mean_delivery_time_difference} more minutes to deliver food on Weekdays ({percentage_additional_mean_dt_weekday}% longer), as compared to Weekends')
The mean delivery times:
 Weekend: 22.47
 Weekday: 28.34

On average, it takes about 5.87 more minutes to deliver food on Weekdays (21% longer), as compared to Weekends

Observations:¶

The delivery times on Weekdays are about about 21% higher, on average, compared to that on Weekdays. The side by side box plot also shows the upward shift of the delivery times distribution for the Weekdays. The delivery time Interqaurtile Ragnes are:

  • Weekends: IQR = 27 - 18.25 = 8.75 minutes
  • Weekdays: IQR = 31 - 26 = 5 minutes

Per above, the delivery times are spread much wider on Weekends than Weekdays.

Conclusion and Recommendations¶

Question 17: What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]¶

Conclusions:¶

  • Out of 1200 unique customers, 784 (65%) are one time customers, meaning they only used this service just once. i.e only 35% are repeat customers

    There exists a major variatons in orders for cusine types that can be grouped as follows by order count:

  • American & Japanese (Over 470)
  • Italian & Chinese (200 - 300)
  • Mexican, Indian, Middle Eastern, Mediterranean (45 - 80)
  • Thai, French, Southern, Korean, Spanish (10 - 20)
  • Vietnamese (7)

  • Cuisines/Restaurants with higher order count tend to have higher average order costs, with mean order cost being: $16.50.

    Revenue Per Order

  • 79 orders were less than 6 USD, contributing to 0 revenue
  • 395 orders were between 6 USD and 10 USD, with revenue per order at: 1.17 USD
  • 972 orders were between 11 USD and 20 USD, with revenue per order at: 2.21 USD

    Order Rating

  • 736 out of the 1,898 orders did not receive any customer rating
  • Ratings positively correlate to order cost
  • Ratings do not correlate to prep time
  • Delivery times impact ratings

    Operational Challenges

    • Approximately 10.5% of orders take more than an hour

Recommendations:¶

Improve Customer Loyality
  • If a one time customer does not return within an anticipated number of days (as learned from other repeat customers), generate an online incentive such as a discount coupon etc. to improve repeat customer count

    Balancing out Cuisine Type Variety

    • Add less represented Cuiine Type restaurants to attract more customer types
    • Due to these being in single digit, target additional Vietnamese, Korean, Spanish, Middle Eastern, French, Thai, Southern, Mediterranean, restaurants

Improve Revenue for low Cost Orders

  • Setup a mimimum order for delivery service to be higher than $5 to avoid money losing orders
  • Based on break even analysis, come up with a flat delivery fee for low cost orders to avoid taking on money losing orders
  • To capitalize on high order counts for some restaurants, in addition to revenue by cost, consider negotiating higher revenue by order count i.e asked to be paid a bonus by the restaurant, if you sell more in a given period for that restaurant, say N orders every week or month
  • Ensure Vietnamese and Korean restaurants have higher cost menu options

Reduce Delivery Times

  • Time delivery pickups to 25 minutes after order placement to minimize wait times
  • Introduce other means of reducing delivery times such as use e-scooters that can go where other vehicles can't
  • Batch deliveries for restaurants with high order counts.
  • Investigate & experiment with Drone Technology to see if it can be a viable option - legaly, logistically & cost wise